Compression in Spark

Data compression is a set of steps for packing data into a smaller space by reducing the number of bits needed to represent data while allowing for the original data to be seen again.Compressing data can save storage capacity, speed up file transfer, and decrease costs for storage hardware and network bandwidth.

The design of data compression schemes involves trade-offs among various factors, for example, the degree of compression, the amount of distortion introduced, the computational resources required for compression and decompression etc.

In Spark 2.4 / 3.0

overall supported compressions are: 

  • none, 
  • uncompressed, 
  • snappy, 
  • gzip, 
  • lzo, 
  • brotli, 
  • lz4, 
  • zstd 

How compression works

Compression is performed by a program that uses a formula or algorithm to determine how to shrink the size of the data. Data compression may be lossy or lossless.

 Lossless Compression


Lossless compression enables the restoration of a file to its original state, without the loss of a single bit of data, when the file is decompressed. It is done by eliminating data redundancy.  Data redundancy is a condition created within a database or data storage environment in which the same piece of data is held in multiple places.By eliminating redundancy, you are left with just one instance of each bit of data.

For example, data containing “AAAAABBBB” can be compressed into “5A4B”. This type of compression is called “run-length encoding” because the “run” of a character is defined. In the above example, there are two runs: one of 5 A’s, another of 4 B’s.

However, run-length encoding only works on long pieces of the same value of data. If your data contains “ABBAABAAB”, run-length encoding would yield “1A2B2A1B2A1B”; which is longer than the original. Here another method can be used: checking how often a particular value comes up in the whole data package, called frequency compression.

The most common kind of frequency compression is called Huffman coding. The basic idea is to give each distinct value in a piece of data a code: more frequent values get shorter codes, less frequent ones get longer codes. Lossless compression is the typical approach with executables, and text and spreadsheet files, where the loss of data would change the information.

Examples of lossless compression:

  • Archiving formats: Zip, GZip, bZip2, 7-Zip, etc.
  • Images/diagrams: GIF, PNG, PCX
  • Video: HEVC, MPEG-4 AVC

Lossy compression


Lossy compression permanently eliminates bits of data that are redundant, unimportant or imperceptible. Lossy compression is used for graphics, audio, video, and images, where the removal of some data bits has little or no discernible effect on the representation of the content.

A downside to this method is a reduction in quality if you compress the same file over and over again using the lossy method.

Examples of lossy compression:

    Images: JPEG
    Audio: MP3, Windows Media
    Video: MPEG, DivX, Windows Video

Compression in Spark

import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}


object demo {
def main(args: Array[String]): Unit = {

val spark = SparkSession
.builder()
.appName("Spark File Compression")
.master("local")
.getOrCreate()


 val empDF = spark.createDataFrame(Seq(
      (7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      (7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      (7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      (7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      (7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      (7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      (7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      (7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      (7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      (7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      (7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
    )).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno").cache()

empDF.write.mode("overwrite").format("parquet").option("compression", "none").mode("overwrite").save("/tmp/file_no_compression_parq")
empDF.write.mode("overwrite").format("parquet").option("compression", "gzip").mode("overwrite").save("/tmp/file_with_gzip_parq")
empDF.write.mode("overwrite").format("parquet").option("compression", "snappy").mode("overwrite").save("/tmp/file_with_snappy_parq")
//lzo – requires a different method in terms of implementation.
empDF.write.mode("overwrite").format("orc").option("compression", "none").mode("overwrite").save("/tmp/file_no_compression_orc")
empDF.write.mode("overwrite").format("orc").option("compression", "snappy").mode("overwrite").save("/tmp/file_with_snappy_orc")
empDF.write.mode("overwrite").format("orc").option("compression", "zlib").mode("overwrite").save("/tmp/file_with_zlib_orc")
}
}

General Guidelines

  • You need to balance the processing capacity required to compress and uncompress the data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network. The correct balance of these factors depends upon the characteristics of your cluster and your data, as well as your usage patterns.
  • Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.
  • GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
  • BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. HBase does not support BZip2 compression.
  • Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.
  • For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.
  • For MapReduce, you can compress either the intermediate data, the output, or both. Adjust the parameters you provide for the MapReduce job accordingly.

No comments:

Post a Comment